Genetic analyses and dispersal patterns unveil the Amazonian origin of guava domestication

Guava (Psidium guajava L.) is a semi-domesticated fruit tree of moderate importance in the Neotropics, utilized for millennia due to its nutritional and medicinal benefits, but its origin of domestication remains unknown. In this study, we examine genetic diversity and population structure in 215 plants from 11 countries in Mesoamerica, the Andes, and Amazonia using 25 nuclear microsatellite loci to propose an origin of domestication. Genetic analyses reveal one gene pool in Mesoamerica (Mexico) and four in South America (Brazilian Amazonia, Peruvian Amazonia and Andes, and Colombia), indicating greater differentiation among localities, possibly due to isolation between guava populations, particularly in the Amazonian and Andean regions. Moreover, Mesoamerican populations show high genetic diversity, with moderate genetic structure due to gene flow from northern South American populations. Dispersal scenarios suggest that Brazilian Amazonia is the probable origin of guava domestication, spreading from there to the Peruvian Andes, northern South America, Central America, and Mexico. These findings present the first evidence of guava domestication in the Americas, contributing to a deeper understanding of its evolutionary history.


Genetic diversity and genetic differentiation
The PCA provided evidence of genetic structure of guava across its geographical range.The first two principal components explained 18.5% of the total variation (Fig. 1a).Amazonian guavas from Brazil and Peru (BRA-AM and PER-AM, respectively) formed well-defined clusters.In contrast, guava samples from Colombia (COL) and Venezuela (VEN) overlap with those from the Antilles (ANT), documenting the close relationship between these regions.Given that the centroids of the Colombian and Venezuelan clusters do not co-occur within their respective standard deviation ellipses, we decided to define the Colombian (COL) and Venezuela-Antilles (VEN-ANT) clusters separately.Similarly, we defined a Peruvian Andes (PER-AND) cluster, Central American (CenAme) cluster and a Mexican (MEX) cluster.We decided to discard the samples from southern Brazil (BRA-SP), because the origin of these samples is uncertain (Fig. 1a).Therefore, for subsequent analyses, we used 192 samples.
The DAPC analysis with 7 defined groups from the results of the PCA revealed a clear differentiation of guavas from Brazilian and Peruvian Amazonia and the Peruvian Andes (Fig. 1b).These results were consistent with the patterns obtained using PCA but with far clearer differentiations among clusters.Likewise, the Mesoamerican (MEX + CenAme) and the northern South American clusters (COL + VEN-ANT) appeared as admixed groups (Fig. 1b).We performed a second DAPC analysis excluding the Peruvian and Brazilian well-differentiated groups to depict the relationships among Mesoamerican and northern South American groups.This analysis showed that VEN-ANT cluster was well differentiated from COL, CenAme and MEX, which are more closely related (Supplementary Fig. 2).
The Mexican and Central American localities showed the highest H E values, while the lowest H E values were found in the Peruvian Andes and Peruvian Amazonia (Table 1).When considering all the samples, guava showed high values of total and unbiased expected heterozygosity, and low values of average observed heterozygosity (H O ) (Table 1).The results for each locality maintain this pattern.The inbreeding coefficient showed high values for most of the localities (Table 1).
The results of the STRU CTU RE analysis were consistent with the results from PCA and DAPC.Evanno and Jane's methods indicated an optimal value of K = 3 and K = 5 as the most likely numbers of genetic clusters (Supplementary Fig. 3).In K = 3, three genetic clusters are probable.Cluster 1 is predominant in Mexico, strongly represented in Central America, moderately represented in Colombia, and a minor component of Venezuela and the Antilles (Fig. 2).Cluster is 2 predominant in Brazilian Amazonia and a relatively minor component of all other localities.The Peruvian Andes, the only locality with 100% of the plants fully attributed to cluster 3 (Fig. 2), was strongly represented in Peruvian Amazonia, Colombia, Venezuela, and the Antilles, and present in Brazilian Amazonia, Central America, and Mexico.At K = 4, although it is not an optimal value according to Evanno and Jane's methods, a new genetic cluster appears in Colombia, foreshadowing K = 5 (Fig. 2).In K = 5, each of the five clusters is dominant in a separate locality: (1) Peruvian Amazonia; (2) Brazilian Amazonia; (3)  Peruvian Andes; (4) Colombia; (5) Mexico.Venezuela and the Antilles are mixtures of Peruvian Amazonia and Colombia, while Central America is a mixture of Peruvian Amazonia, Colombia, and Mexico (Fig. 2).
This sequence of clusters from K = 3 to K = 5 suggests that western South America (BRA-AM → PER-AM → PER-AND) contains the origin of domesticated guava, given the number of clusters dominated by distinct genetic groups.From western South America, guava was then dispersed northward through Colombia to Central America and Mexico.Colombia is also the crossroads for the guava that arrived in Venezuela and later the Antilles.The fact that Mexico is always a clear cluster suggests that dispersal happened long enough ago for the thorough mixing of origins that became a distinct genetic group.
Estimates of Wright's F among the sampling localities indicated that guava diversity is higher within localities than among localities; nevertheless, we found intermediate levels of genetic differentiation with F ST = 0.21 (Supplementary Table 3).F IT (0.64) and F IS (0.54) estimates were higher compared to F ST (Supplementary Table 3).All fixation indexes were statistically significant.These results suggest that the frequency of heterozygotes is lower than expected under HWE.The pairwise F ST values suggested moderate to high genetic differentiation between localities (Fig. 3).The largest difference was observed for the Peruvian Andes, Peruvian Amazonia, and Brazilian Amazonia.Mesoamerica (MEX and CenAme) and northern South America (COL and VEN-ANT) showed less pairwise genetic differentiation (Fig. 3).
The Mantel test revealed no significant correlation between genetic and geographic distance matrices (R 2 = − 0.287, p = 0.779), indicating a lack of isolation by distance.We found that PER-AND, PER-AM, and BRA-AM localities, which are geographically closer to each other, are genetically less similar (Supplementary Fig. 4; Supplementary Table    www.nature.com/scientificreports/ between MEX, CenAme, VEN, and COL (Supplementary Tables 4-5).According to AMOVA, the variation among samples within localities (41%, Φ = 0.52) is higher than between localities (21.04%,Φ = 0.62; Table 2).Approximate Bayesian Computation (ABC) analyses indicated that the best supported dispersal hypothesis was scenario 2 (Fig. 4) with posterior probability of 0.999 and non-overlapping confidence intervals.This scenario showed low type I and type II error rates (0.00013; Supplementary Table 6), suggesting that domestication started in South America, specifically in Brazilian Amazonia (Brazil-AM), with dissemination to Mexico via Peruvian Amazonia (PER-AMA) and northern South America (COL).

Discussion
Perennial trees, characterized by their extended lifespan and delayed sexual reproduction, tend to exhibit a weak population structure 35,36 .Domestication profoundly influences the population dynamics and genetic structure of species, shaped by a complex set of evolutionary events involving both natural factors and ancestral and contemporary human activities.The present study provides the first evidence of the domestication history of guava in the Neotropics and assesses scenarios of the species' dispersal in the region.
Our analyses found that values of guava genetic diversity expressed as H E ranged from 0.44 to 0.70.These levels are comparable to those of other perennials with populations domesticated to some degree, such as A. cherimola, Olea europaea L., and Prunus armeniaca L. 9,37,38 .However, compared to other Psidium species, P. guajava has high genetic diversity, for example, in a single population in Southeast Brazil the maximum H E values (H E = 0.71) are comparable to those of P. guineense Sw. (H E = 0.74) and P. macahense O. Berg (H E = 0.63) 39 .In contrast, an insular Psidium species (P.galapageium Hook.f.) has moderate to low (H E = 0.275-0.570)genetic diversity 40 .P. cattleianum Afzel.ex Sabine also showed lower diversity values (H E = 0.117-0.326) 41.
Observed heterozygosity (Ho) systematically showed lower values than expected heterozygosity, suggesting heterozygote deficiency among guava due to inbreeding.In different islands of the Galapagos 42 and guava samples from a germplasm bank 43 registered similar results.These findings may be explained, in part, by self-fertilization and vegetative propagation, which can occur in P. guajava 44 .Likewise, Robertson's 45 hypotheses could be useful  www.nature.com/scientificreports/ to explain the heterozygosity decline in guava.He proposes that subdividing a population into several isolated groups would allow maximum genetic diversity (minimum global co-ancestry) to be achieved in the long term since different allelic variants will develop and become fixed in each group, becoming a genetic reservoir of variation.However, complete isolation leads to higher rates of local inbreeding with the possible consequence of inbreeding depression.Therefore, Robertson 45 also suggests that occasional mixing of these subpopulations would minimize the overall rate of inbreeding.In support of this hypothesis, we found lower rates of global inbreeding for guava.Likewise, Mantel's analysis suggests long-distance gene flow, especially between the populations of northern South America (COL and VEN) and Mesoamerica (MEX and CenAme; Supplementary Tables 4-5), which would allow the reduction of inbreeding and its effects.Overall, the absence of isolation by distance, the broad range of F ST values, the high F IS value, and the separate gene pools indicated by PCA, DAPC, and STRU CTU RE suggest a metapopulation dynamic.Local cultivated guava populations may originate from the surrounding genetic variation and occasionally receive long-distance gene flow.Finally, further studies are needed to examine the cause of heterozygote deficiency in guava.Contrary to our hypothesis of south-to-north decrease in diversity, the genetic diversity pattern, expressed as H E is just the opposite.A decreasing trend in H E was observed from Mexico and Central America (H E = 0.70) to the Peruvian Andes (PER-AND = 0.44).This pattern can be explained by the mixture of guavas from different regions occurring in Central America and Mexico.In these areas, most of the individuals show signatures of admixture of well-defined South American genetic groups.Hence, anthropic dispersal may have enhanced guava genetic diversity in Central America and Mexico.In addition, the diversity of environmental conditions, new biotic interactions, and selection pressures in Mesoamerica could have contributed to the maintenance of genetic variants that were present in the gene pool due to mutations.These events would explain an increased guava genetic diversity in response to new environmental conditions and challenges, a hypothesis that is testable by using ecological niche models [46][47][48] .In addition, whether this pattern points towards a center of genetic diversity or is the result of admixture among clusters is a matter to be evaluated by rating explicit demographic scenarios.
In our study, the genetic differentiation of P. guajava populations yielded an F ST value of 0.207, indicating moderate differentiation, considering the wide geographical range across which the species is distributed.Likewise, the molecular variance is higher among individuals/within localities than among localities.Similar findings have been reported for other perennial fruit trees like A. cherimola, Diospyros kaki L.F., Juglans regia L., Mangifera indica L., O. europaea, P. persica L., and P. armeniaca 9,37,38,[49][50][51][52] .In the case of guava, the observed F ST value may be attributed to limited gene flow among the sampled localities, which span different regions of the Neotropics.Similarly, the pattern of variance identified here can be due to outcrossing and guava's invasive (successional) character (Hamrick et al. 53 and therein).In cultivated and invasive populations of guava, the genetic variation pattern is also similar, with higher genetic variance among individuals/between populations and clearly defined genetic groups 42,54,55 .
Regarding the genetic clustering found in our study, each of the most geographically isolated populations from South America (Peruvian Andes [PER-AND], and Brazilian and Peruvian Amazonia [BRA-AM; PER-AMA]) belongs to a distinct genetic group and shows greater differentiation in relation to other groups (F ST ; Fig. 3).Localities in northern South America, Central America, and Mesoamerica show lower values of genetic differentiation with some individuals being admixed.This scenario suggests a pattern of greater differentiation among localities in South America, probably due to the isolation between guava populations, especially between the Amazonian and Andean regions.In Amazonia, the vast expanse of tropical rainforest and relatively homogeneous climatic conditions have favored the domestication of a variety of crops such as Manihot esculenta Crantz, T. cacao, and various fruits and nuts.This region is been an independent center of plant domestication, where indigenous peoples have managed and cultivated numerous crops over millennia, resulting in notable genetic diversity within these crops 1,6,27,56,57 .In contrast, the Andean ecosystems' altitudinal and climatic variability has led to the genetic differentiation of plants adapted to specific microenvironments.This environmental diversity has promoted the evolution of plants with unique genetic traits necessary for surviving extreme conditions, resulting in a mosaic of locally adapted crops, each with distinct genetic variations 1,6,27,56,57 .Moreover, human interaction with the environment in both regions has played a crucial role.In Amazonia, landscape management practices, such as the creation of "terra preta" (Amazonian Dark Earths), have enriched the soil and fostered crop diversification.In the Andes, agricultural techniques such as terracing and irrigation, have enabled the adaptation and cultivation of plants on steep slopes and less fertile soils 1,57,58 .
Therefore, the localities evaluated here would have likely been exposed to specific evolutionary processes, considering the climatic and ecological characteristics of their geographical origin's setting, thus promoting differentiation between them.Variable admixture levels among populations may also be the outcome of diverse trade routes and human migrations over time 30 , as is the case of J. regia 50 and M. indica 51 .
According to the best-supported ABC scenario, Amazonia is the most probable area of domestication of guava, and the first dispersal route likely was from there towards the Peruvian Andes.This result agrees with the oldest archaeological guava macro remains found in Southwestern Amazonia in the Teotonio archeological site, in a layer between 9490 and 6505 cal.BP 27 .In South America, the lowlands of southwestern Amazonia are recognized as a relevant center of domestication 1,27,56,57 and the place from which important species, such as manioc and peanut (Arachis hypogaea L.), dispersed towards the Peruvian dry coast 58 .Indeed, a significant number of archaeological guava remains dating from 6975 to 450 BP have been reported from the Peruvian dry coast (see Fig. 4 in Arévalo-Marín et al. 30 ).This evidence also supports the hypothesis that guava could have spread through the Andes, from Amazonia to the Peruvian coast, as the best supported scenario in this study suggests.Therefore, more detailed archaeological and genetic studies that include samples from both the southwestern region and other areas of Amazonia would allow for a confirmation of the domestication area of guava.
In summary, our study provides an overview of the genetic diversity and population structure of guava in the Neotropics.The microsatellite markers and Bayesian clustering approaches identified the presence of one www.nature.com/scientificreports/gene pool in Mesoamerican (Mexico) and four in South America (Brazilian Amazonia, Peruvian Amazonia and Andes, and Colombia).The high genetic differentiation between the Brazilian and Peruvian Amazonia and Peruvian Andes guava samples could be due to environmental differences, since guava subpopulations in distinct geographic settings may reflect divergent local adaptation.Niche analyses are needed to understand whether climatic events could explain these hypotheses, and genomic analyses would allow testing of hypotheses of local adaptation.On the other hand, the ABC approach identified Brazilian Amazonia as the potential area of guava domestication, with subsequent dispersal into western and northern South America and Mesoamerica where local diversification processes occurring in these last two regions could also underlie the observed diversity patterns.Follow-up studies that include defined populations of feral and cultivated guavas, and focused sampling in southwestern Amazonia and the Andes could help to unravel the guava domestication process.

Material
We studied 215 guava plants from 11  We also included samples collected outside germplasm banks from Brazilian Amazonia (38 plants), Peruvian Amazonia (15 plants), the Peruvian Andes (19 plants), and 14 samples from different localities in Venezuela.We considered samples collected in the Peruvian Andes and Brazilian Amazonia as tolerated or planted because these were collected in empty lots, roadsides, orchards, and gardens.Samples from Venezuela and Peruvian Amazonia were collected in areas far from plantations or crops; however, since it is difficult to distinguish between wild and feral guavas, each of these samples were considered feral.Because of this collection strategy, which is commonly used with cultivated plants, we are not dealing with biological populations, so we will call groups of plants from different areas "localities".All methods were performed in accordance with the relevant guidelines and regulations, and appropriate permissions for the collection of plant material were obtained from all relevant parties.

Molecular methods
DNA was extracted from young leaves using a CTAB-based protocol 59 .Initially, all 215 individuals were genotyped using 25 nuclear microsatellite loci developed for P. guajava 60,61 .We combined five primers in each of the five multiplex reactions (see Supplementary Table 7 for primers and multiplex reaction details).PCRs were performed using the Platinum Multiplex PCR Master Mix (Thermo-Fisher, USA) following the manufacturer's instructions for reaction assembly and program.Every reaction was driven to a 5.5 μL final volume containing 2.0 μL Platinum Multiplex PCR Master Mix, 2.0 μL PCR grade H 2 O, 0.5 μL G/C enhancer volume, 1.0 μL DNA template (50-200 ng/μL), and primer concentrations between 50 to 70 nM according to each product's relative fluorescent units (RFU).Multiplex reactions required an annealing temperature of 55 °C for all primers; 40 cycles were used in every PCR reaction.When amplification was not successful, we repeated the PCR reactions using 0.04 μL Kapa polymerase (Kapa Taq HotStart), 2.0 μL Buffer Kapa, 2.0 μL PCR grade H 2 O, and 1.0 μL DNA template (50-200 ng/μL).The annealing temperature and the number of cycles were maintained.To prevent possible contamination, we used negative controls for each multiplex assembly.All products were verified in 2% agarose gel electrophoresis.PCRs were carried out in a MultiGene OptiMax (Labnet International, Inc., Edison, NJ, USA) or in a 2700 thermal cycler (Applied Biosystems, Foster City, CA, USA).Genotyping was achieved using the Microsatellite plugin (v.1.4.7) of Geneious Prime 2022 (Dotmatics, NZ).Allele scoring was performed manually following Selkoe and Toonen 62 .

Null alleles, Hardy-Weinberg equilibrium, and linkage disequilibrium tests
We tested the presence and frequency of null alleles following Brookfield 63 using the PopGenReport v.3.0.7 package 64 in R. We calculated deviations from Hardy-Weinberg equilibrium (HWE) for each locus and separately for each locality.Also, we calculated HWE across all samples using the 'hw.test' function of the R package pegas v.1.1 65 , with 1000 Monte Carlo permutations.Alpha levels to determine statistically significant deviations from Hardy-Weinberg proportions and independent sorting of genotypes were adjusted using the false discovery rate (FDR) approach developed by Benjamini and Hochberg 66 , using 0.05 alpha level.P-values were corrected for multiple comparisons using the Benjamini-Hochberg method 66 .We calculated a measure of correlation (r̄d 67 using the function 'ia' 68 in the R package poppr v.2.9.3, for testing overall linkage disequilibrium.Using the function 'genotype curve' of the same package, we described the genotypic diversity in relation to different combinations of loci by a genotype accumulation curve to determine whether our sample provided a reasonable estimate of genetic diversity.The curve was generated by randomly sampling x loci and counting the number of multilocus genotypes (MLG) observed.This sampling was repeated r times from 1 to n − 1 loci, creating n − 1 distributions of observed MLGs 69 .

Genetic diversity and genetic differentiation
Genetic differentiation was examined using several complementary approaches.First, as an exploratory method, we performed a Principal Components Analysis (PCA) to summarize the genetic variation based on the microsatellite data set.Subsequently, we performed a Discriminant Analysis of Principal Components (DAPC) 70 .DAPC is an approach that optimizes the separation of individuals into predefined groups using a discriminant function of the principal component 70 .Based on DAPC, the membership probability was calculated for the overall genetic background of an individual.We used the components identified in the PCA analysis as predefined groups for the DAPC implementation.For implementing the PCA, we used the 'dudi.pca'function from ade4 v.1.7-22R 71 and visualized it with factoextra v.1.0.7 R 72 .For DAPC, we used adegenet v.2.2.10 73 implemented in R.
We assessed standard measures of genetic diversity for the entire dataset and genetic groups according to DAPC results.The number of individuals (N), number of alleles (A), and the expected (H E ) and observed (H O ) heterozygosities were calculated using poppr v.2.9.3 68 in R. We estimated rarefied allelic richness using the 'allel.rich' function of PopGenReport v.3.0.7 64 in R. Private allele richness (AP) were calculated using a rarefaction approach 74,75 implemented in ADZE 1.0 76 .
As an additional test to calculate the group assignment probability for each sample, we performed genetic population structure analysis using the Bayesian approach implemented in STRU CTU RE 2.3.4 77,78 , based on the admixture model with correlated allele frequencies and information on the origin of localities (popinfo = 1).The admixture model was tested for K-values ranging from 1 to 8, since 8 is the number of sampled regions, with 10 independent runs per K value for the entire dataset.We used 1,000,000 Markov Chain Monte Carlo iterations with a burn-in length of 100,000.To determine the most probable value of K, we used Evanno's ΔK method 79 and mean LnP(K) 80 implemented in Structure Harvester v0.6.94 81 .We used CLUMPP 1.1.2 82with the Greedy algorithm to infer the optimal K-cluster affiliations of samples and StructuRly v.0.1.0 83in R to generate bar graphs of the STRU CTU RE software results.
Wright's F statistics 84 (F IS , F IT , and F ST ) were estimated using the methods of Weir and Cockerham 85 .We also calculated the genetic differentiation among localities through a pairwise F ST matrix.Both the F statistics and the paired F ST matrix were calculated with 95% confidence intervals from 10,000 bootstrapping, using the 'diffCalc' function of diveRsity 86 .A Mantel test 87 was used to assess isolation by distance (IBD) between pairs of guava localities.We used the geographic distance matrix transform from coordinates in Euclidean distance and calculated using the function 'dist' in the stats v.4.3.1 in R and a linearized pairwise F ST matrix (F ST /1 − F ST ) as genetic distance.The function 'mantel.rtest'from ade4 v.1.7-22 88was used to calculate the Mantel test, and scatter-plots were then generated with adegenet v.2.2.10 73 .
We also tested the degree of genetic differentiation between DAPC groups (determined here) and locations, performing the analysis of molecular variance (AMOVA) followed by an estimation of the extent of genetic differentiation with phi-statistics, both using the 'poppr.amova'function in poppr v.2.9.3 68 .The significance of variance components was assessed using a permutation test implemented through the 'randtest' function in ape4 v.5.7 71,89,90 with 999 permutations.

Identification of the origin of domestication
We used nuclear microsatellite data to run the Approximate Bayesian Computation (ABC) framework 91,92 implemented in DIYABC-RF GUI 93 to model a possible branching order among guava localities that would represent the history of domestication of the lineage.We considered five scenarios (1) Mexico as a probable center of origin of domestication with dissemination to South America; (2) South America, specifically Brazilian Amazonia (Brazil-AM) as a probable center and later dissemination to Mexico via Peru; (3) Two independent centers of origin of domestication, one in the Peruvian Andes (Peru-An) and another in Mexico; (4) Peruvian Amazonia (Peru-AM) and Brazil-AM as independents centers of origin of domestication and dissemination towards northern South America with Central America and Mexico being of admixed destination; and (5) domestication in northern South America and dissemination to three areas (Mexico and Central America, Venezuela and Antilles, and Peru and Brazil) (Supplementary Fig. 5).The priors and conditions for each parameter can be found in the Supplementary Material 2; we considered a generation time of 10 years (probable fruiting time in natural conditions) 44 .We conducted previous runs to adjust the tested scenarios and the parameters 94 .For the final run, we obtained 500,000 simulated datasets, 500 trees, and 424 summary statistics.To identify the best supported scenario, we performed model check based on 500 pseudo-observed data sets (PODs) under each scenario to assess confidence in scenario choice, and to estimate the class specific error rates, which is the mean classification error rate 93,95,96

4
).The observed low pairwise F ST values indicates possible long-distance gene flow

Figure 2 .
Figure 2. Assignment probabilities of each of the 192 guava samples to each cluster inferred by STRU CTU RE for K = 3, 4, and 5.Each sample is represented by a vertical bar, and color indicates the probability of belonging to each cluster.Samples are ordered according to the geographic region from southern to northern parts of the Americas.

Figure 3 .
Figure 3. F ST genetic differentiation values among the 192 samples of guava grouped by localities.

Table 1 .
Genetic diversity obtained with 24 nuclear microsatellite loci for 192 plants of P. guajava from seven localities.N Number of individuals, A number of alleles, Ar rarefied allelic richness, P Ar number of private alleles, H O observed heterozygosity, H E expected heterozygosity, uH E unbiased heterozygosity, F fixation index.

Table 2 .
Results of the analysis of molecular variance (AMOVA) testing for differentiation among localities in P. guajava.
dfSum of squares % variance Φ-statistic p-value countries.We collected 86 samples from Brazil, Colombia, Honduras, Mexico, and Venezuela in the guava germplasm bank of the Instituto Nacional de Investigaciones Forestales, Agrícolas y Pecuarias (INIFAP) in Aguascalientes, Mexico; 17 samples from the guava collection of the Tropical Agricultural Research and Higher Education Center (CATIE) in Turrialba, Costa Rica, from Costa Rica, El Salvador, Guatemala, and Honduras; and 26 samples from Brazil, Colombia, and Puerto Rico in the guava collection of the Corporación Colombiana de Investigación Agropecuaria (Agrosavia) in Palmira, Colombia.